[Day 24] 繼續關於CNV的部份

2019 iT 邦幫忙鐵人賽

DAY 24

自我挑戰組

When Bioinfo met Julia: Bioinformatician的30天Julia學習之路系列第 25 篇

2019鐵人賽 julialang bioinformatics cnv

nostalgie1211

2018-10-25 22:20:06

1254 瀏覽

分享至

Read-depth

由於這陣子在搞的東西跟CNV有非常大的關係，所以也陸續參考了不少文獻。前面我提過了絕大多數的演算法都是基於read depth這個值衍生出來的，而我那時也寫了一個閹割版的BEDtools makewindows的功能，今天我抽了點時間把前面那一步怎麼產生genomeInfo file的功能也寫好，搭配後面計算每個windows中reads數目的小工具來做一個介紹。

用`BioSequences.jl`計算每個chromosome的長度

using BioSequences.FASTA
using DataFrames
using CSV

function getChromLenFromRef(fasta::BioSequences.FASTA.Reader, output::String) 
	IDs = Array{String}(undef, 0)
	Length = Array{Int}(undef, 0)
	for record in fasta
		id = identifier(record)
		seq = FASTA.sequence(record)
		seqlen = length(seq)
		push!(IDs, id)
		push!(Length, seqlen)
	end
	outputDF = DataFrame(:Chromosome => IDs, :Length => Length)
	CSV.write(output, outputDF, delim = "\t")
end

genomefile = "human_GRCh38_p12.fa"  ## 這是從NCBI上面下載來的人類參考基因體序列
fastaInfo = FASTA.Reader(open(genomefile, "r"))
genomeinfofile = "genomeInfoFile.txt"
getChromeLenFromRef(fastaInfo, genomeinfofile)
run(`sed -i '1d' $genomeinfofile`)

用我自己寫的`makeWindowsForBED.jl`

function makeWindows(df::DataFrame, winSize::Int, windowsBED::String)
    Chr = Array{String}(undef, 1)
    Start = Array{Int}(undef, 1)
    Stop = Array{Int}(undef, 1)
	for c = 1:nrow(df)
		chr = df.chr[c]
		chrLen = df.chrlen[c]
		winNum = Int(floor(chrLen/winSize))
		restLen = chrLen % winSize
		for i = 0:winNum
			startpos = i * winSize
			if i != winNum
				endpos = startpos + winSize
			else
				endpos = startpos + restLen
			end
            push!(Chr, chr)
            push!(Start, startpos)
            push!(Stop, endpos)
		end
	end
    winDF = DataFrame(:chr => Chr, :start => Start, :stop => Stop)
    CSV.write(windowsBED, winDF, delim = "\t")
end

windowSize = 20000
windowsBED = "windows.bed"
df = CSV.read(genomeinfofile, header = ["chr", "chrlen"], delim = "\t")
makeWindows(df, windowSize, windowsBED)
run(`sed -i '1d' `)

最後使用幾個小工具把這些結果轉換成read-count-per-window的資料

countFile = "sample.count"
run(pipeline(`bam2bed < input.bam`, `awk '$5 > 35 {print $0}'`, stdout=pipeline(`bedmap --count $windowsBED -`, "$countFile")))

接下來就剩下一堆統計分析了